path <- '/Users/nvribeiro/Library/CloudStorage/OneDrive-UMCG/Desktop/PhD/Others/BMS-BigData-Course-2025' # you need to change this to your own path
setwd(path)Single-cell RNA-seq data analysis
Introduction
Single-cell RNA-seq data analysis usually follows the steps shown in the figure below. The first step is the generation of the count matrices using the raw sequencing data. After that we start the quality control steps that can include removal of doublets, ambient RNA correction, and removal of low quality cells (i.e. high mitochondrial content (dying cells) and low number of reads (empty droplets)). Once we have a filtered and high quality dataset, we proceed to the pre-processing steps that include normalization, feature selection, dimensionality reduction, and clustering of the cells (explained with more details later).
For the sake of simplicity and time, we will skip the quality control steps that precede any analysis done with scRNA-seq data. If you want to know more about the quality control steps you can check here and here. Therefore, here we will use a dataset that has already been quality controlled and filtered.
Initial set-up
Set working directory
In this step, you set your working directory, that is, the folder in your computer where you are going to save your scripts, data and results.
Load libraries
If you haven’t done, check the file “Preparation for data analysis – Big Data Course BMS 2025” and follow the instructions to install/update R, RStudio and the packages that you will used in the analysis. Then load the libraries.
library(Seurat)
library(dplyr)
library(tidyr)
library(ggplot2)
library(patchwork)
library(stringr)
library(MAST)
library(ggrepel)
library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)
library(knitr)
library(presto)
library(forcats)# Optional: define a default theme to be used in all plots
default_theme <- function(){
theme_bw() +
theme(panel.grid = element_blank(),
panel.border = element_rect(linewidth = 1),
axis.title = element_text(size = rel(1)),
axis.title.y.left = element_text(margin = margin(r = 10, unit = 'pt')),
axis.title.y.right = element_text(margin = margin(l = 10, unit = 'pt')),
axis.title.x.bottom = element_text(margin = margin(t = 10, unit = 'pt')),
plot.title = element_text(size = rel(1), hjust = 0.5, face = 'bold'),
axis.text = element_text(size = rel(1)),
axis.ticks.length = unit(5, 'pt'),
strip.background = element_rect(fill = "#f5a9a9", linewidth = 0),
strip.text = element_text(size = rel(1), color = "black")
)
}Loading the tutorial data
We will use a small subset of 10,000 single-cells from small intestine biopsies of pediatric samples from the Gut Cell Atlas by Elmentaite at al. (2021). This dataset is included in the files you downloaded previously and is called “Elmentaite_2021_P_smallInt_BMS_tutorial.rds”.
obj <- readRDS("data/Elmentaite_2021_P_smallInt_BMS_tutorial.rds")
objAn object of class Seurat
33538 features across 10000 samples within 1 assay
Active assay: RNA (33538 features, 0 variable features)
2 layers present: counts, data
The code above loads the dataset and tells us that this is a Seurat object with 10,000 cells and 33,538 features (genes), and we are working with a RNA assay. Now we can start pre-processing this data and run some analysis.
Pre-processing
Single-cell RNA-sequencing is prone to many different sources of variance, for example cells can have very different numbers of gene counts because of differences in the amount of mRNA present in them or purely randomly during the sequencing. Normalization of counts make the transcriptome profile of all cells comparable among them. During this process, you also want to remove unwanted sources of variation, for example genes involved in cell cycle or mitochondrial genes. If you want to know more about normalization, you can read this.
Moreover, the counts matrix is huge and sparse (contains a lot of zeros), which means that not every detected gene is biologically informative. The features selection step is useful to reduce the computational power required and make sure the analysis focus only on informative genes. These genes ideally explain the biological variation in the data by prioritizing genes the vary between subpopulations, instead of within a subpopulation. These genes are then used to perform dimensionality reduction, which in this case is a principal component analysis (PCA). Using the PCA dimensions, we can then group cells into clusters based on how similar they are in this multidimensional space and visualize them in a UMAP representation.
The code below takes care of perform all the steps above. The resolution parameter in the function FindClusters() below is the one you need to pay more attention and adjust according to your dataset. A large value will lead to more clusters, and a small one, to fewer. The best way to guess the resolution parameter is to have some idea of how many cell types you expect in your sample based on the tissue you are working it, and then try different values until you find one that gives you a reasonable number of clusters. In this case, we are working with small intestine, a tissue with many different cell types, so we expect to have many different clusters. In the original publication, they have hundreds of different cell types but because we are working only with a small subset of the data and a low resolution to speed thing up a bit, so we will have much less.
options(future.globals.maxSize = 8000 * 1024^2)
obj <- obj %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData(vars.to.regress = "pct_counts_mt") %>% # Regresses out the percentage of mitochondrial RNA
RunPCA() %>%
FindNeighbors(dims = 1:50) %>%
FindClusters(resolution = 0.3) %>%
RunUMAP(dims = 1:50)Warning: The `slot` argument of `SetAssayData()` is deprecated as of SeuratObject 5.0.0.
ℹ Please use the `layer` argument instead.
ℹ The deprecated feature was likely used in the Seurat package.
Please report the issue at <https://github.com/satijalab/seurat/issues>.
Warning: The `slot` argument of `GetAssayData()` is deprecated as of SeuratObject 5.0.0.
ℹ Please use the `layer` argument instead.
ℹ The deprecated feature was likely used in the Seurat package.
Please report the issue at <https://github.com/satijalab/seurat/issues>.
Regressing out pct_counts_mt
Centering and scaling data matrix
PC_ 1
Positive: IGFBP7, CALD1, IFITM3, SPARC, COL1A2, COL3A1, A2M, C1S, RARRES2, COL6A2
SPARCL1, COL1A1, COL6A1, C1R, DCN, CXCL14, MFAP4, LUM, LGALS1, MMP2
TIMP1, TM4SF1, COL4A2, COX7A1, NUPR1, IGFBP4, TPM2, ADAMDEC1, MYL9, GPX3
Negative: IGHM, TCL1A, HLA-DQA1, DUSP2, KLRB1, TRBC1, CCL5, HMGA1, LRMP, PIM2
HLA-DPB1, MZB1, IGKC, GPR183, HLA-DPA1, SRGN, LINC01857, RGS13, NKG7, LINC01871
FAM30A, SPIB, ITGB2, IL7R, CST7, IGLC3, IGLC2, DERL3, TNFRSF17, TRGC2
PC_ 2
Positive: ALDOB, FABP2, ANPEP, PRAP1, GUCA2A, DPEP1, FABP1, PHGR1, AMN, CDHR5
SMIM24, APOB, C19orf33, MEP1A, RBP2, SI, EPCAM, CES2, MTTP, PCK1
ACE, C3orf85, FABP6, AOC1, ENPEP, MGAM, KRT19, SERPINA1, CDHR2, MYO1A
Negative: VIM, CALD1, LGALS1, IGFBP7, COL6A2, RARRES2, COL1A2, COL3A1, C1S, SPARC
COL6A1, A2M, COL1A1, TIMP1, DCN, C1R, MMP2, CXCL14, TPM2, NUPR1
SPARCL1, LUM, IFITM3, MFAP4, TCF4, SELENOM, C11orf96, IGFBP5, SOD3, SERPING1
PC_ 3
Positive: PLVAP, ADGRL4, PCAT19, VWF, RAMP2, RAMP3, ECSCR, CYYR1, EGFL7, CLDN5
CD34, JAM2, CDH5, PECAM1, FLT1, MMRN2, ESAM, CD93, PCDH17, PTPRB
TM4SF18, APLNR, PODXL, CALCRL, CD320, SOX18, ROBO4, FAM110D, MYCT1, TMEM88
Negative: DCN, COL3A1, COL1A2, LUM, MFAP4, CXCL14, C1S, ADAMDEC1, FBLN1, RARRES2
TCF21, COL1A1, CFD, EMILIN1, COL6A3, CCL11, MEG3, COL6A2, MMP2, CTSK
APOE, PDGFRA, C1R, CXCL1, CYGB, COL6A1, SPON2, HAPLN1, ABCA8, EDIL3
PC_ 4
Positive: AGR2, KRT18, OLFM4, GPX2, PIGR, PPP1R1B, SLC12A2, CEACAM5, TFF3, REG4
LGALS4, FCGBP, CLCA1, SPINK1, MUC2, DMBT1, ST6GALNAC1, SPINK4, STARD10, CDC42EP5
REG1A, KRT8, CREB3L1, ITLN1, BCAS1, MGST1, ADH1C, HEPACAM2, FOXA3, SPDEF
Negative: APOB, CYP3A4, SLC6A19, APOA4, ACE, SLC15A1, CUBN, APOC3, ALPI, APOA1
CREB3L3, GUCA2A, XPNPEP2, MGAM, MTTP, C3orf85, DPEP1, PRAP1, ANPEP, CDHR2
GUCA2B, SLC5A12, SLC26A3, SMIM24, SULT1A2, IL32, NAALADL1, CPO, MEP1B, MS4A10
PC_ 5
Positive: PCLAF, MKI67, TOP2A, STMN1, UBE2C, HMGB2, BIRC5, NUSAP1, AURKB, RRM2
TYMS, PTTG1, CDK1, CDKN3, CENPF, TK1, CCNA2, GTSE1, TPX2, CDC20
TUBB, RGS13, CENPA, CKS2, MND1, HMMR, ASPM, MYBL2, H2AFZ, CDT1
Negative: CLCA1, MUC2, FCGBP, REG4, BCAS1, TFF3, SPINK4, ZG16, HEPACAM2, REP15
S100P, CEACAM5, ATOH1, MLPH, SPDEF, CREB3L1, TFF1, SCNN1A, ITLN1, FOXA3
ST6GALNAC1, LRRC26, CEACAM6, STARD10, ZG16B, FXYD3, TPSG1, NPDC1, TSPAN1, AC005833.1
Computing nearest neighbor graph
Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 10000
Number of edges: 433662
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9557
Number of communities: 21
Elapsed time: 0 seconds
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
15:03:27 UMAP embedding parameters a = 0.9922 b = 1.112
15:03:27 Read 10000 rows and found 50 numeric columns
15:03:27 Using Annoy for neighbor search, n_neighbors = 30
15:03:27 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:03:27 Writing NN index file to temp file /var/folders/x6/85llj18n3j18wmy0ylhx6pdm0000gn/T//RtmpEx0dZs/filee68e549b53be
15:03:27 Searching Annoy index using 1 thread, search_k = 3000
15:03:29 Annoy recall = 100%
15:03:29 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
15:03:30 Found 2 connected components, falling back to 'spca' initialization with init_sdev = 1
15:03:30 Using 'irlba' for PCA
15:03:30 PCA: 2 components explained 32.73% variance
15:03:30 Scaling init to sdev = 1
15:03:30 Commencing optimization for 500 epochs, with 438422 positive edges
15:03:30 Using rng type: pcg
15:03:37 Optimization finished
DimPlot(obj, reduction = "umap")Inspecting the clustering
Large experiments usually are done in batches and that can be a cofounder that we don’t want to include in our analysis. If there is a strong batch effect, cells from the same cell type will cluster together based on their batch and not on their true identity. A quick way to check if you have a strong batch effect in your data is to plot the UMAP above but colored (or split) by batch. (You can also do this with any other variable that you have in your metadata)
DimPlot(obj, reduction = "umap", group.by = "batch")# If you want to try the split version, un-comment (delete the '#") and run the line below
# DimPlot(obj, reduction = "umap", split.by = "batch") + NoLegend()Identifying the clusters
Since we don’t see major batch effects, we can continue our analysis. Now, we want to identify the clusters. What cell types are they? To answer this question, we need to find the genes that define each cluster, or marker genes. These are genes that are differentially expressed in that cluster specifically when compared to all other cells. This is done using the code below. The function FindAllMarkers() performs a differential expression testing for each clusters versus all other cells, therefore, it might take a few minutes to run, depending on your computer’s specs. (You can skip this step if it takes too long or if your machine can’t handle it and go to to the cell identification step).
marker_genes <- FindAllMarkers(obj,
only.pos = TRUE, # return only the positive differentially expressed genes
min.pct = 0.1, # only test genes that are detected in at least 10% of the cells in either cluster
logfc.threshold = 0.25) # the logFC of a gene between the two groups has to be at least 0.25This gives you a large table with all the results. To make it easier to look at the markers, let’s reshape this and display only the top 10 most significant marker genes for each cluster.
# Getting the top 10 markers per cluster based on p-val and FC
top.markers <- marker_genes %>%
group_by(cluster) %>%
filter(p_val_adj < 0.05) %>%
arrange(p_val_adj, by.group = TRUE) %>%
top_n(n = 10, wt = avg_log2FC)
# Reshaping the table
top.markers.clean <- top.markers %>%
dplyr::select(cluster, gene) %>%
group_by(cluster) %>%
summarise(markers = str_c(unlist(pick(gene)), collapse=', '))
knitr::kable(top.markers.clean)| cluster | markers |
|---|---|
| 0 | LINC00926, BANK1, FCER2, LINC02397, IGHD, GAPT, HVCN1, FAM129C, CD1C, TNFRSF13B |
| 1 | CCL5, KLRD1, TRGC2, CD160, GZMA, XCL1, XCL2, TRGC1, KLRC2, GZMK |
| 2 | APOB, APOA4, APOC3, APOA1, SLC5A12, MEP1B, SLC13A1, SLC2A2, ENPP7, FADS6 |
| 3 | TIGIT, CD40LG, LEF1, ICOS, CTLA4, CHRM3-AS2, TOX2, CD28, TBC1D4, TNFRSF4 |
| 4 | LUM, ADAMDEC1, CCL11, CXCL6, HAPLN1, FNDC1, CCL8, CCL13, SFTA1P, OGN |
| 5 | RRM2, CDK1, CLSPN, MND1, GTSE1, CENPA, CDCA5, SPC25, DLGAP5, FDCSP |
| 6 | OLFM4, GPX2, PLA2G2A, LGR5, GP2, REG3A, ITLN2, PRSS2, DEFA6, DEFA5 |
| 7 | IGHA2, JCHAIN, LINC02362, AL391056.1, ENAM, AMPD1, IGLL5, IGHA1, LINC00582, FAM92B |
| 8 | FPR3, LILRB2, CD209, FCN1, CD163, LILRB5, SIGLEC1, S100A12, CD300E, TLR8 |
| 9 | RGS13, SERPINA9, SUGCT, BORCS8-MEF2B, HTR3A, WDR66, CAMP, HRK, LINC00877, MIR3681HG |
| 10 | MUC2, CLCA1, FCGBP, ZG16, REP15, TFF1, ATOH1, CAPN9, B3GNT6, SYTL5 |
| 11 | PLVAP, MMRN2, SOX18, FAM110D, PALMD, SOX17, SELE, LCN6, FCN3, CADM3-AS1 |
| 12 | HLA-DQA2, MTRNR2L1, LINC00926, BANK1, IGHD, LINC02397, GBP4, FCRL1, C12orf42, HAPLN3 |
| 13 | IGHG2, IGHGP, IGHG3, IGHG4, IGHG1, IGLV3-1, IGLV6-57, JSRP1, IGKC, IGLC3 |
| 14 | ALKAL2, ENHO, POSTN, NPY, VEGFD, GLP2R, LY6H, VSTM2A, COL4A6, LINC01915 |
| 15 | NOTCH3, COX4I2, HIGD1B, FAM162B, FOXS1, HEYL, FHL5, IL17B, AC018647.1, AVPR1A |
| 16 | ACTG2, HHIP, SOSTDC1, HSD17B6, DES, MYOCD, SLC30A8, LUZP2, CCDC144NL-AS1, ALKAL1 |
| 17 | NRXN1, CDH19, GFRA3, PCSK2, LRRTM1, SOX2, FOXD3, PTPRZ1, LINC01505, LINC00461 |
| 18 | LYVE1, MMRN1, C6orf141, RELN, PLIN5, AC007998.3, LINC02308, STAB2, AC011498.4, GPR1 |
| 19 | TPSAB1, CPA3, TPSB2, MS4A2, KRT1, HDC, ADCYAP1, TPSD1, AL662860.1, LINC00323 |
| 20 | CRYBA2, SCGN, SCG2, MIR7-3HG, MARCH4, CDH22, AC005256.1, LCN15, PYY, NTS |
With the table above and a reference gene markers list, or a very good knowledge of biology, or a lot of Googling, you can identify each cluster. Luckly for us, this dataset already has the cell type annotated in the metadata, so we will use that to help us. Let’s first see how the clusters are distributed in the original cell type annotation.
ggplot(obj@meta.data, aes(x = seurat_clusters, y = cell_type)) +
geom_count() +
theme_classic()As mentioned above, we have way less clusters and they are not as good defined as in the original publication, because we are working with just a small part of it. So, we are adding the following simplified annotation based on the overlap seen in the plot above:
| Seurat cluster | Cell type |
|---|---|
| 0, 9, 12 | B cells |
| 1 | Activated T cells |
| 2 | Enterocytes |
| 3 | T cells |
| 4, 14, 15, 16, 18 | Stromal cells |
| 5, 7, 13 | Plasma cells |
| 8 | Macrophages |
| 6 | Stem and TA cells |
| 10 | Goblet cells |
| 19 | Mast cells |
| 11 | Endothelial cells |
| 17 | Glia |
| 20 | EECs |
We will add this new cell type annotation as a new column in the metadata called cell_type_new. Then, we can visualize the UMAP with this new annotation.
obj@meta.data <- obj@meta.data %>%
mutate(cell_type_new = case_when(
seurat_clusters %in% c(0,9,12) == TRUE ~ "B cells",
seurat_clusters %in% c(1) == TRUE ~ "Activated T cells",
seurat_clusters %in% c(2) == TRUE ~ "Enterocytes",
seurat_clusters %in% c(3) == TRUE ~ "T cells",
seurat_clusters %in% c(4, 14, 15, 16, 18) == TRUE ~ "Stromal cells",
seurat_clusters %in% c(5, 7, 13) == TRUE ~ "Plasma cells",
seurat_clusters %in% c(8) == TRUE ~ "Macrophages",
seurat_clusters %in% c(6) == TRUE ~ "Stem and TA cells",
seurat_clusters %in% c(10) == TRUE ~ "Goblet cells",
seurat_clusters %in% c(19) == TRUE ~ "Mast cells",
seurat_clusters %in% c(11) == TRUE ~ "Endothelial cells",
seurat_clusters %in% c(17) == TRUE ~ "Glia",
seurat_clusters %in% c(20) == TRUE ~ "EECs"
))
DimPlot(obj, group.by = "cell_type_new", label = T) + NoLegend() # this puts the labels in the plot instead of a legend on the sideNow that we have identified our cell types, we can perform some analysis.
When you do your own analysis, you will have to actually identify the cell types yourself. To get help with that, check the “Cell type indentification” tutorial.
Downstream analyses
Differential expression analysis
In this dataset we have samples from individuals with Crohn’s Disease and healthy individuals, so we can investigate if there is a difference in gene expression of the cell types we identified because of an individual being sick. Let’s first just check if cells are present in both conditions by plotting the UMAP split by condition.
p <- DimPlot(obj, group.by = "cell_type_new", split.by = "Diagnosis")
print(p)Now we can start the differential testing. We are using a method called MAST, that has been shown to perform well in single-cell data and allows us to correct for covariates such as batch, sex, age, and the cellular detection rate (the fraction of genes that are detected/expressed in each cell).
# Choose the cell type for that you want to test for
cluster <- "Enterocytes"
obj_test <- subset(obj, subset = cell_type_new == cluster)
# Set the dataset to test control vs disease
Idents(obj_test) <- "Diagnosis"
# Calculating CDR (cellular detection rate) to add in the model and correct for it
CDR <- scale(colSums(GetAssayData(obj_test, assay = "RNA", layer = "data") > 0))
obj_test$CDR <- CDR
# Finding DEGs - note how CDR is included in the model via the latent.vars parameter
DEG <- FindMarkers(obj_test, ident.1 = 'Pediatric Crohn Disease', ident.2 = 'Pediatric healthy', test.use = 'MAST', logfc.threshold = 0,
min.pct = 0.25, latent.vars = 'CDR')
Done!
Combining coefficients and standard errors
Calculating log-fold changes
Calculating likelihood ratio tests
Refitting on reduced model...
Done!
# Create a dummy variable for gene names - it will be useful for plotting
DEG$gene <- rownames(DEG)
# Add the classification if a gene is up or down regulated based on the defined cutoffs for log2FC and adjusted p-value
DEG$DE <- NA
threshold <- 0.5 # cutff for logFC
DEG <- DEG %>%
mutate(DE = case_when(
p_val_adj < 0.05 & avg_log2FC > threshold ~ 'upregulated',
p_val_adj < 0.05 & avg_log2FC < -threshold ~ 'downregulated',
p_val_adj > 0.05 | abs(avg_log2FC) < threshold ~ 'not DE'),
DE_gene = if_else(DE=='not significant', NA, gene)
)
# Order and print the most upregulated genes
DEG <- DEG %>%
arrange(desc(avg_log2FC)) # to see the most downregulated genes, just remove the desc()
knitr::kable(head(DEG))| p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | gene | DE | DE_gene | |
|---|---|---|---|---|---|---|---|---|
| FOLH1 | 0 | 2.534950 | 0.342 | 0.090 | 0e+00 | FOLH1 | upregulated | FOLH1 |
| HRCT1 | 0 | 1.569167 | 0.263 | 0.165 | 8e-07 | HRCT1 | upregulated | HRCT1 |
| RARRES1 | 0 | 1.451417 | 0.286 | 0.209 | 1e-06 | RARRES1 | upregulated | RARRES1 |
| HLA-DPB1 | 0 | 1.422872 | 0.603 | 0.327 | 0e+00 | HLA-DPB1 | upregulated | HLA-DPB1 |
| MTRNR2L8 | 0 | 1.328422 | 0.765 | 0.590 | 0e+00 | MTRNR2L8 | upregulated | MTRNR2L8 |
| SLC52A1 | 0 | 1.284953 | 0.367 | 0.232 | 2e-07 | SLC52A1 | upregulated | SLC52A1 |
We can now visualize the DEG results as a volcano plot and highlight some of the genes using the code below.
color_map <- c("not DE" = "gray", "downregulated" = "#2166ac", "upregulated" = "#b2182b")
ggplot(DEG, aes(x = avg_log2FC, y = -log10(p_val_adj), color = DE, label = gene)) +
geom_jitter() +
geom_text_repel(max.overlaps = 5, force = 30, show.legend = FALSE) +
geom_hline(yintercept = -log10(0.05), colour = '#696969', linetype = 'dashed') +
geom_vline(xintercept = -threshold, colour = '#696969', linetype = 'dashed') +
geom_vline(xintercept = threshold, colour = '#696969', linetype = 'dashed') +
scale_color_manual('', values = color_map) +
labs(title = "DEGs in Enterocytes in Chron's Disease") +
default_theme()Warning: ggrepel: 2296 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
Pathway enrichment analysis
The next question we can ask is if the up or downregulated genes are involved in common functions, like we did in the bulk RNA-seq analysis. To keep it simple, we will do just a GSEA here, but during your analysis, feel free to try the other methods, the idea is the same.
# Create the rannked gene list required for the function, this will be the DEGs ordered by log2FC
geneList_df <- DEG %>%
filter(DE != "not significant") %>%
dplyr::select(gene, avg_log2FC, p_val_adj) %>%
arrange(desc(avg_log2FC))
geneList <- geneList_df$avg_log2FC
names(geneList) <- geneList_df$gene
gsea <- gseGO(
geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "BP",
keyType = "SYMBOL",
minGSSize = 10,
pvalueCutoff = 0.05
)This returns an object with the results that we can inspect with head(gsea@result). Although this table is good to inspect the results in detail, it is not a great way to visualize the results. For that, we can plot the most significant enriched and suppressed pathways. (In this case we only have significant suppressed pathways, so the plot will only have those, but you get the idea).
# This gets the top 5 enriched and supressed pathways based on the Normalized Enrichment Score (NES)
tmp <- arrange(gsea, desc(abs(NES))) %>%
group_by(sign(NES)) %>%
slice(1:5)
# Now we plot them
ggplot(tmp, showCategory = 10,
aes(NES, fct_reorder(Description,NES), fill = -log10(p.adjust))) +
geom_col() +
geom_vline(xintercept = 0, colour = '#696969', linetype = 'dashed') +
scale_fill_gradientn(colours=c("#b3eebe", "#46bac2", "#371ea3"),
guide=guide_colorbar(reverse=TRUE)) +
labs(x = 'Normalized Enrichment Score',
y = '') +
default_theme() +
theme(text = element_text(size = 11))Conclusion
This covers the basics of pre-processing a single-cell RNA-seq dataset and doing some downstream analysis to answer biological questions. Feel free to play with this code and try some different clustering parameter, use different cell types for the differential testing, etc. When you feel ready, run your own analysis with the assignment dataset.